Lecture 7: August 21st, 2023#

Check-in Question: Happy Monday! How was your weekend? For those in SoCal, were you able to get through the storm okay? Reach out to Yasmeen if there are any issues as a result of the storm (e.g. power outage).

Updates and Reminders

EDA Outcome Quizzes:

  • Try last week’s EDA outcome quizzes by midnight tonight (if you haven’t already). I’ll unlock more attempts for those who are missing them after this first round closes.

  • We’re ever-so-slightly behind where I thought we’d be, but I’d still like to make this the EDA checkpoint week. What that means for us:

    • There are four (4) EDA outcomes we haven’t seen yet. I will release quizzes for these outcomes by the end of the day, and give everyone four (4) attempts for each.

    • What does “checkpoint week” mean? It simply means I won’t be opening new quizzes for EDA outcomes after this week. If you are missing an EDA outcome that you want after the attempts this week, reach out to me so we can see what you’re missing and work to fill in any gaps. I would then recommend submitting an SLO revision form.

Today:

  • I want to show you some really cool interactive features of Altair. We’ll spend some time working through examples, as well as leafing through the documentation.

  • The interactive Altair features is how I’d like to round-out our EDA Unit 3 material. EDA Unit 4 we’ll try and start today, and finish off on Wednesday. I expect EDA Unit 4 to be a little bit shorter (and less exciting) than our previous units. The point will be to review some important python concepts unrelated to data science, before we formally begin our ML Units (exciting!).

  • Optimist me wants to say we’ll start ML on Wednesday, but most likely we’ll really begin on Friday.

We’ll start today’s lecture by going through the section on Multi-view plots in Altair. Other than creating the distinct rows, treat the creation of the chart as a warm-up.

Multi-view plots in Altair#

import altair as alt

Extremely important for interactive portions: let’s make sure we’re on Altair version 5.0.0 (at least).

alt.__version__
'5.0.1'
import seaborn as sns
df = sns.load_dataset("mpg")
  • Make a facet chart using “horsepower” for the x-coordinate, “mpg” for the y-coordinate, “cylinders” for the color with the Nominal data encoding type, and dividing the data according to the number of cylinders. Put each chart in its own row.

df.head()
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
#skeleton of chart without divisions
alt.Chart(df).mark_point().encode(
    #default encoding is Q
    x="horsepower",
    y="mpg",
    #:N means nominal encoding
    color="cylinders:N",
    shape="cylinders:N"
)

Recall: Altair recognizes these two types of categorical data: Nominal (without a natural ordering), and Ordinal (with a natural ordering).

Note: Just because data has a natural ordering, doesn’t mean we need to use it. Notice that the numer of cylinders is odered, but we can still tell Altair to treat it as Nominal.

#now with divisions along number of cylinders - vertical stacking
alt.Chart(df).mark_point().encode(
    #default encoding is Q
    x="horsepower",
    y="mpg",
    #:N means nominal encoding
    color="cylinders:N",
    shape="cylinders:N",
    row="cylinders:N"
)

Question: What would be a benefit of stacking the data like this?

Brainstorming:

  • One benefit of stacking the data like this is that we can compare values from within a certain group (in thise case number of cylinders).

  • Notice that since we’ve stacked these charts vertically, it’s easy to compare the difference in horsepower between cylinders by drawing a vertical line from top to bottom.

  • What could I do if I wanted to stack the charts horizontally to compare mpg accross cylinders?

#now with divisions along number of cylinders - horizontal stacking
alt.Chart(df).mark_point().encode(
    #default encoding is Q
    x="horsepower",
    y="mpg",
    #:N means nominal encoding
    color="cylinders:N",
    shape="cylinders:N",
    column="cylinders:N"
)

Interactive charts in Altair - mpg dataset#

Interactive charts are one of the most cool parts of Altair! We’ve already seen a little bit of interactivity by including a tooltip. Let’s see a few more examples today. We can get some inspiration by checking out this link. Today, we will incorporate ideas from the following:

  • Interactive rectangular brush

  • Selection Histogram

  • Import Altair and check that it is at least version 5.

alt.__version__
'5.0.1'
  • Import the mpg dataset from Seaborn and save it with the name df.

df.head()
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
  • Make a chart with “horsepower” along the x-axis, “weight” along the y-axis, and color/shape representing “origin”. Call this chart c1.

c1 = alt.Chart(df).mark_point().encode(
    x="horsepower",
    y="weight",
    color="origin:N",
    shape="origin:N"
)

c1
  • Following the Interactive Rectangular Brush example, add a selection interval to c1.

brush = alt.selection_interval()

c1 = alt.Chart(df).mark_point().encode(
    x="horsepower",
    y="weight",
    color="origin:N",
    shape="origin:N"
).add_params(brush)

c1

The chart looks exactly the same so far! How do I check? Drag a region over the chart :)

Yay! It’s working! But…doesn’t really do anything yet…

Spoiler: The purpose of this square will be to pass data to another chart.

  • Again using the documentation, make the selection so that it grays out everything that’s not in the box.

We’re going to use alt.condition() for the color. This tells us what to do if a condition is true, and what to do if a condition is false. This should remind you of np.where from NumPy.

brush = alt.selection_interval()

c1 = alt.Chart(df).mark_point().encode(
    x="horsepower",
    y="weight",
    color=alt.condition(brush,"origin:N", alt.value('grey')),
    shape="origin:N"
).add_params(brush)

c1

What if I wanted to specify the color scheme for “origin” as well?

brush = alt.selection_interval()

c1 = alt.Chart(df).mark_point().encode(
    x="horsepower",
    y="weight",
    color=alt.condition(brush, alt.Color("origin:N").scale(scheme="turbo"), alt.value('grey')),
    shape="origin:N"
).add_params(brush)

c1

alt.Color().scale() allows us to access properties of Color (in this case, the scheme).

Motivation: We’ll now construct a bar chart that depends on our selection from c1.

  • Make a bar chart with “origin” along the x-axis, and “count()” along the y-axis. Call this chart c2.

df.columns
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'name'],
      dtype='object')

Recall: "count()" is a property of Altair. In this case, it will count all of the entries corresponding to each place of origin.

c2 = alt.Chart(df).mark_bar().encode(
    x="origin:N",
    y="count()"
)

c2

How could I check these values using value_counts()?

df["origin"].value_counts()
usa       249
japan      79
europe     70
Name: origin, dtype: int64

Question from the chat: Can we swap xy-axes here?

c3 = alt.Chart(df).mark_bar().encode(
    x="count()",
    y="origin:N"
).transform_filter(brush)
  • Following the example from Selection Histogram, use .transform_filter(brush) to tell Altair to change c2 depending on our selection from c1.

c2 = alt.Chart(df).mark_bar().encode(
    x="origin:N",
    y="count()"
).transform_filter(brush)

#If we try to view c2 now, there will be an error, because there's no selection from c1 yet.
  • Display c1 and c2 side-by-side by calling c1 | c2.

c1 | c2
c1 & c3

Our charts should be looking pretty good right now, but there are a few things I think we could improve:

  • Fix the y-axis on c2 so that it’s not changing with every new selection.

  • Similar for the x-axis.

From looking at value_count() above with the origin, I know that the maximum number that appears is 250. This will influence how I set the domain for the y-axis.

c2 = alt.Chart(df).mark_bar().encode(
    x=alt.X("origin:N").scale(domain=df["origin"].unique()),
    y=alt.Y("count()").scale(domain=(0,250))
).transform_filter(brush)
c1 | c2

Interactive charts in Altair: Spotify dataset#

  • Import the attached Spotify dataset as df. In this csv file, missing values are denoted by a blank space. Use the na_values keyword argument with pd.read_csv so that those blank spaces get converted to np.nan.

import pandas as pd
df = pd.read_csv("spotify.csv", na_values=" ")
df.head()
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48,633,449 MĂĄneskin 3377762.0 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.714 0.800 -4.808 0.0504 0.1270 0.3590 134.002 211560.0 0.589 B
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022.0 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.591 0.764 -5.484 0.0483 0.0383 0.1030 169.928 141806.0 0.478 C#/Db
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514.0 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.563 0.664 -5.044 0.1540 0.3350 0.0849 166.928 178147.0 0.688 A
3 4 3 5 2021-07-02--2021-07-09 Bad Habits 37,799,456 Ed Sheeran 83293380.0 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 0.808 0.897 -3.712 0.0348 0.0469 0.3640 126.026 231041.0 0.591 B
4 5 5 1 2021-07-23--2021-07-30 INDUSTRY BABY (feat. Jack Harlow) 33,948,454 Lil Nas X 5473565.0 27NovPIUIRrOZoCHxABJwK ['lgbtq+ hip hop', 'pop rap'] ... 0.736 0.704 -7.409 0.0615 0.0203 0.0501 149.995 212000.0 0.894 D#/Eb

5 rows Ă— 23 columns

  • Check your work by evaluating value_counts on df.dtypes. If everything worked correctly, there should be 11 float columns, 3 integer columns, and 9 object columns.

df.dtypes.value_counts()
float64    11
object      9
int64       3
dtype: int64
  • Plot the data from df using Altair. Encode the “Acousticness” data as the x-coordinate, the “Energy” data as the y-coordinate, and encode the “Valence” data as the color.

df.columns
Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
       'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
       'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
       'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
       'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
       'Chord'],
      dtype='object')
df.shape
(1556, 23)
alt.Chart(df).mark_circle().encode(
    x="Acousticness",
    y="Energy",
    color="Valence"
)
  • Adjust the color scheme used to the dark2 color scheme.

alt.Chart(df).mark_circle().encode(
    x="Acousticness",
    y="Energy",
    color=alt.Color("Valence").scale(scheme="dark2")
)

This color scheme still doesn’t look too good to me (some colors look very similar, even though they are on completely different sides of the spectrum)…let’s find a better one.

  • Change the color scheme from dark2 to a different one of these options (scroll down to find the options).

alt.Chart(df).mark_circle().encode(
    x="Acousticness",
    y="Energy",
    color=alt.Color("Valence").scale(scheme="spectral")
)
  • Add a tooltip to the chart, indicating the Artist name and the song name.

alt.Chart(df).mark_circle().encode(
    x="Acousticness",
    y="Energy",
    color=alt.Color("Valence").scale(scheme="spectral"),
    tooltip=["Artist","Song Name"]
)

A chart with the 50 most frequently occurring artists#

  • Define a new variable s containing the pandas Series corresponding to the “Artist” column in df.

s = df["Artist"]
  • Call the value_counts method on s.

s.value_counts()
Taylor Swift                     52
Lil Uzi Vert                     32
Justin Bieber                    32
Juice WRLD                       30
Pop Smoke                        29
                                 ..
Chris Brown, Young Thug           1
Rauw Alejandro, J Balvin          1
347aidan                          1
Migrantes, Alico                  1
Dadá Boladão, Tati Zaqui, OIK     1
Name: Artist, Length: 716, dtype: int64
  • Using the previous result, find the 50 most frequently occurring artists in this dataset. (Note: the value_counts method automatically sorts the results from most frequent to least frequent.)

s.value_counts()[:50]
Taylor Swift               52
Lil Uzi Vert               32
Justin Bieber              32
Juice WRLD                 30
Pop Smoke                  29
BTS                        29
Bad Bunny                  28
Eminem                     22
The Weeknd                 21
Ariana Grande              20
Drake                      19
Billie Eilish              18
Selena Gomez               17
J. Cole                    16
Doja Cat                   16
Dua Lipa                   15
Lady Gaga                  14
Tyler, The Creator         14
DaBaby                     14
21 Savage, Metro Boomin    12
Olivia Rodrigo             12
Kid Cudi                   12
Mac Miller                 11
Polo G                     11
Lil Baby                   10
Post Malone                10
Sam Smith                   9
BLACKPINK                   9
The Kid LAROI               9
J Balvin                    9
Travis Scott                9
Ed Sheeran                  9
Joji                        8
Apache 207                  7
XXXTENTACION                7
Morgan Wallen               7
Megan Thee Stallion         6
Maluma                      6
Lil Nas X                   6
Ava Max                     6
Miley Cyrus                 6
Machine Gun Kelly           6
Rauw Alejandro              6
Bonez MC                    6
Migos                       6
5 Seconds of Summer         5
Anuel AA                    5
Shawn Mendes                5
Lauv                        5
Jonas Brothers              5
Name: Artist, dtype: int64
  • Define a variable top_artists which contains these top 50 artists. (Hint. You might want to use the index attribute.)

I want to get the Artist names out of this series…

type(s.value_counts()[:50])
pandas.core.series.Series

Since the artist names are the indices of this series, we could use the index attribute of a pandas Series object.

top_artists = s.value_counts()[:50].index
top_artists
Index(['Taylor Swift', 'Lil Uzi Vert', 'Justin Bieber', 'Juice WRLD',
       'Pop Smoke', 'BTS', 'Bad Bunny', 'Eminem', 'The Weeknd',
       'Ariana Grande', 'Drake', 'Billie Eilish', 'Selena Gomez', 'J. Cole',
       'Doja Cat', 'Dua Lipa', 'Lady Gaga', 'Tyler, The Creator', 'DaBaby',
       '21 Savage, Metro Boomin', 'Olivia Rodrigo', 'Kid Cudi', 'Mac Miller',
       'Polo G', 'Lil Baby', 'Post Malone', 'Sam Smith', 'BLACKPINK',
       'The Kid LAROI', 'J Balvin', 'Travis Scott', 'Ed Sheeran', 'Joji',
       'Apache 207', 'XXXTENTACION', 'Morgan Wallen', 'Megan Thee Stallion',
       'Maluma', 'Lil Nas X', 'Ava Max', 'Miley Cyrus', 'Machine Gun Kelly',
       'Rauw Alejandro', 'Bonez MC', 'Migos', '5 Seconds of Summer',
       'Anuel AA', 'Shawn Mendes', 'Lauv', 'Jonas Brothers'],
      dtype='object')

Question from the chat: Could we try casting this to a list?

Notice this returns the values, not the indices….

list(s.value_counts()[:50])
[52,
 32,
 32,
 30,
 29,
 29,
 28,
 22,
 21,
 20,
 19,
 18,
 17,
 16,
 16,
 15,
 14,
 14,
 14,
 12,
 12,
 12,
 11,
 11,
 10,
 10,
 9,
 9,
 9,
 9,
 9,
 9,
 8,
 7,
 7,
 7,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 5,
 5,
 5,
 5,
 5]

But! We could try converting to a dictionary

dict(s.value_counts()[:50]).keys()
dict_keys(['Taylor Swift', 'Lil Uzi Vert', 'Justin Bieber', 'Juice WRLD', 'Pop Smoke', 'BTS', 'Bad Bunny', 'Eminem', 'The Weeknd', 'Ariana Grande', 'Drake', 'Billie Eilish', 'Selena Gomez', 'J. Cole', 'Doja Cat', 'Dua Lipa', 'Lady Gaga', 'Tyler, The Creator', 'DaBaby', '21 Savage, Metro Boomin', 'Olivia Rodrigo', 'Kid Cudi', 'Mac Miller', 'Polo G', 'Lil Baby', 'Post Malone', 'Sam Smith', 'BLACKPINK', 'The Kid LAROI', 'J Balvin', 'Travis Scott', 'Ed Sheeran', 'Joji', 'Apache 207', 'XXXTENTACION', 'Morgan Wallen', 'Megan Thee Stallion', 'Maluma', 'Lil Nas X', 'Ava Max', 'Miley Cyrus', 'Machine Gun Kelly', 'Rauw Alejandro', 'Bonez MC', 'Migos', '5 Seconds of Summer', 'Anuel AA', 'Shawn Mendes', 'Lauv', 'Jonas Brothers'])
  • (More difficult.) Use the isin method (documentation) and Boolean indexing to define a new pandas DataFrame df2 which is the sub-DataFrame of df containing only the 50 most frequently occurring artists.

df2 = df[df["Artist"].isin(top_artists)]
df2
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022.0 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.591 0.764 -5.484 0.0483 0.03830 0.1030 169.928 141806.0 0.478 C#/Db
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514.0 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.563 0.664 -5.044 0.1540 0.33500 0.0849 166.928 178147.0 0.688 A
3 4 3 5 2021-07-02--2021-07-09 Bad Habits 37,799,456 Ed Sheeran 83293380.0 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 0.808 0.897 -3.712 0.0348 0.04690 0.3640 126.026 231041.0 0.591 B
4 5 5 1 2021-07-23--2021-07-30 INDUSTRY BABY (feat. Jack Harlow) 33,948,454 Lil Nas X 5473565.0 27NovPIUIRrOZoCHxABJwK ['lgbtq+ hip hop', 'pop rap'] ... 0.736 0.704 -7.409 0.0615 0.02030 0.0501 149.995 212000.0 0.894 D#/Eb
5 6 1 18 2021-05-07--2021-05-14 MONTERO (Call Me By Your Name) 30,071,134 Lil Nas X 5473565.0 67BtfxlNbhBmCDR2L2l8qd ['lgbtq+ hip hop', 'pop rap'] ... 0.610 0.508 -6.682 0.1520 0.29700 0.3840 178.818 137876.0 0.758 G#/Ab
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1545 1546 128 1 2019-12-27--2020-01-03 Candy 5,632,102 Doja Cat 8671649.0 1VJwtWR6z7SpZRwipI12be ['dance pop', 'pop'] ... 0.689 0.516 -5.857 0.0444 0.51300 0.1630 124.876 190920.0 0.209 G#/Ab
1548 1549 178 1 2019-12-27--2020-01-03 Old Town Road 4,852,004 Lil Nas X 5488666.0 2YpeDb67231RjR0MgVLzsG ['lgbtq+ hip hop', 'pop rap'] ... 0.878 0.619 -5.560 0.1020 0.05330 0.1130 136.041 157067.0 0.639 F#/Gb
1549 1550 187 1 2019-12-27--2020-01-03 Let Me Know (I Wonder Why Freestyle) 4,701,532 Juice WRLD 19102888.0 3wwo0bJvDSorOpNfzEkfXx ['chicago rap', 'melodic rap'] ... 0.635 0.537 -7.895 0.0832 0.17200 0.4180 125.028 215381.0 0.383 G
1551 1552 195 1 2019-12-27--2020-01-03 New Rules 4,630,675 Dua Lipa 27167675.0 2ekn2ttSfGqwhhate0LSR0 ['dance pop', 'pop', 'uk pop'] ... 0.762 0.700 -6.021 0.0694 0.00261 0.1530 116.073 209320.0 0.608 A
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614.0 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.448 0.603 -7.176 0.0640 0.43300 0.0862 205.272 221307.0 0.422 G

678 rows Ă— 23 columns

  • Check your answer: the shape of df2 should be 678 by 23.

df2.shape
(678, 23)

Interactive Altair Chart#

Here, we create an interactive chart to go along with df2 that we just made above.

  • Make the same chart as you made above, with the only difference being, that you now use df2 instead of df for the data.

alt.Chart(df2).mark_circle().encode(
    x="Acousticness",
    y="Energy",
    color=alt.Color("Valence").scale(scheme="spectral"),
    tooltip=["Artist","Song Name"]
)
  • Add a selection_interval object named brush to the chart.

brush = alt.selection_interval()
alt.Chart(df2).mark_circle().encode(
    x="Acousticness",
    y="Energy",
    color=alt.Color("Valence").scale(scheme="spectral"),
    tooltip=["Artist","Song Name"]
).add_params(brush)
  • Assign this chart to the variable name c1 using the code c1 = alt.Chart....

brush = alt.selection_interval()
c1 = alt.Chart(df2).mark_circle().encode(
    x="Acousticness",
    y="Energy",
    color=alt.Color("Valence").scale(scheme="spectral"),
    tooltip=["Artist","Song Name"]
).add_params(brush)
  • Display this chart by evaluating c1.

c1
  • Check your work: if you click and drag on the chart, there should be a grey rectangle that appears. (Once you’ve displayed the grey rectangle, you can move it around.)

  • Make a second chart c2 showing a bar chart for the selected data as in the previous part of lecture. The x-axis should correspond to Artist names (only the top 50 since we’re using df2) and the y-axis should correspond to the number of times those artists appear in the selection. (Use transform_filter with brush, as in the above notes.)

c2 = alt.Chart(df2).mark_bar().encode(
    x="Artist",
    y="count():N"
).transform_filter(brush)
  • Display c1 and c2, one after the other, using c1&c2. (If you instead want them to appear side-by-side, you can use c1|c2.)

c1&c2
  • Find an image you like (including the selection) and save it using the … “Save as PNG” from the top right of the Deepnote cell with the two charts.

  • Upload that file to this Deepnote project, and embed that png file in a markdown cell. The syntax is ![alt text](path).

Created in deepnote.com Created in Deepnote